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Abstract: The aim of our study was to analyze gaze fixations in recognizing facial emotional expressions in comparison 
with o the spatial distribution of the areas with the greatest increase in the total (nonlocal) luminance contrast. It is hypothesized 
that the most informative areas of the image that getting more of the observer's attention are the areas with the greatest increase 
in nonlocal contrast. The study involved 100 university students aged 19-21 with normal vision. 490 full-face photo images were 
used as stimuli. The images displayed faces of 6 basic emotions (Ekman’s Big Six) as well as neutral (emotionless) expressions. 
Observer’s eye movements were recorded while they were the recognizing expressions of the shown faces. Then, using a 
developed software, the areas with the highest (max), lowest (min), and intermediate (med) increases in the total contrast in 
comparison with the surroundings were identified in the stimulus images at different spatial frequencies. Comparative analysis of 
the gaze maps with the maps of the areas with min, med, and max increases in the total contrast showed that the gaze fixations 
in facial emotion classification tasks significantly coincide with the areas characterized by the greatest increase in nonlocal 
contrast. Obtained results indicate that facial image areas with the greatest increase in the total contrast, which preattentively 
detected by second-order visual mechanisms, can be the prime targets of the attention. 
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Introduction 


The ability to recognize a facial expression is considered as a component of emotional intelligence 
and plays important part in human communication, including educational communication (Kosonogov 
V. et al., 2019; Belousova and Belousova, 2020; Budanova |., 2021). In recent years, symptoms of a 
disruption of the ability to perceive facial expressions are often a special subject of therapeutic interventions 
(Skirtach et al., 2019). Contemporary research also acknowledges the genetic influence on functioning of 
the systems involved in the recognition and experiencing of emotions (Vorobyova et al., 2019). 

Previous studies generally confirm that faces are detected and perceived faster than objects of 
other categories (Liu et al., 2000; Liu, Harris and Kanwisher, 2002; Crouzet, Kirchner and Thorpe, 2010; 
Crouzet and Thorpe, 2011). A face is not only categorized in a scene in less than 100 ms, but this time 
is enough to form a first impression of a person (Willis and Todorov, 2006; Cauchoix et al., 2014). MEG 
studies show the medial prefrontal cortex and amygdala activation in the first 95 ms during differentiating 
facial expressions (Liu and loannides, 2010). It is suggested that the ability to quickly recognize faces is 
mediated by a special “facial module”, and the appearance of a face in the visual field automatically turns 
on this processing system (Fodor, 1983; 2000; Kanwisher, 2000; Rivolta, 2014). 

The ultra-rapid saccades can be initiated to the face faster than to any other objects (Crouzet, 
Kirchner and Thorpe, 2010). At the same time, the movement of the eye always is the result of shifting 
attention to a new area in the visual field (Theeuwes, 2014). Moreover, because faces are simply more 
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effective at engaging observer's attention than other objects, they may already have a competitive 
advantage in the preattentive stage of processing (Reddy, Wilken and Koch, 2004). Prior research 
substantiates the belief that ultra-rapid face detection is largely determined by the information extracted 
preattentively (Vuilleumier, 2002; Honey, Kirchner and VanRullen, 2008; Allen, Lien and Jardin, 2017; see 
also the review by Tamietto and DeGelder, 2010). 

since selecting visual content is determined by the principle of maximizing information (Bruce and 
Tsotsos, 2005), then in face recognition the gaze distribution shows the most informative areas. These 
areas are proven to be the eyes, nose and mouth (Luria and Strauss, 1978; Mertens, Siegmund and 
Grusser, 1993; Eisenbarth and Alpers, 2011). However, at the preattentive level of visual processing, 
there are no mechanisms that are selective to facial features. But the human visual system contains 
preattentive filters called second-order visual mechanisms (see the review by Graham, 2011). These 
filters are capable of highlighting areas of spatial heterogeneity in images. And it is namely these areas 
that can be the most informative (Itt, Koch and Niebur, 1998 ; Itti and Koch, 2001; Gao and Vasconcelos, 
2007; Gao, Han and Vasconceloset, 2009; Hou et al., 2013). 

The existence of second-stage filters at first has been predicted theoretically (Babenko, 1989; 
Chubb and Sperling, 1989; Sutter, Beck and Graham, 1989) and then was repeatedly attested by 
numerous experimental studies (e.g. Dakin and Mareschal, 2000; Landy and Orug, 2002; Kingdom, Prins 
and Hayes, 2003; Reynaud and Hess, 2012; Babenko and Ermakov, 2015). These mechanisms combine 
the outputs of first-order visual filters (simple striate neurons) in a certain way and respond to spatial 
modulations of brightness gradients (their contrast, orientation, or spatial frequency). 

Initially it has been assumed that the targets of attention can be local heterogeneities identified 
by first-order filters (Bergen and Julesz, 1983). However, more recent evidence reveals that higher-level 
traits have an advantage over lower-level traits in controlling overt attention (Frey, Konig and Einhauser, 
2007; Acik et al., 2009). So now, it is clear that the targets for attention are probably the extended areas of 
the image, which differ from the surroundings in their physical characteristics. Based on such differences 
various models of bottom-up saliency have been promoted over the past two decades (Hou and Zhang, 
2007; Valenti, Sebe and Gevers, 2009; Perazzi et al., 2012; Wu et al., 2012; Marat et al., 2013; Xia et al., 
2015). 

The aim of our work is to analyze gaze fixations in recognizing facial emotional expressions in 
comparison with to the spatial distribution of the areas with the greatest increase in the total (nonlocal) 
contrast. The research hypothesis is that the most informative areas of the facial image that getting more 
of the observer's attention could be the areas with the greatest increase in nonlocal contrast. 


Materials and Methods 


Participants 

The study sample consisted of 100 university students (Europeans, women 59%) aged 19 to 
21 years (average age 20.4 + 2.6). All participants had normal or normalized vision and no history of 
neurological or psychiatric illness. In the initial stage of the process all participants were informed about 
the study’s purpose and procedure and gave written consent for voluntary participation. The study was 
approved by the local ethics committee and conducted in accordance with the ethical standards of The 
Code of Ethics of the World Medical Association (Declaration of Helsinki). 


Stimuli 

490 full-face photo images were used as stimuli, which were selected from open excess databases: 
MMI (Pantic et al., 2005), KDEF (Lundqvist, Flykt and Ohman, 1998), Rafd (Langner et al., 2010) and 
WSEFEP (Olszanowski et al., 2015). The number of male and female faces was equal (245 each). 
These were the faces of adult Caucasians. The images displayed faces of 6 basic emotions according 
to P. Ekman (anger, disgust, fear, happiness, sadness and surprise) (Ekman, 1992) and a neutral facial 
expression. We aligned the images by average brightness and RMS contrast and inscribed them into a 
conditional circle of 880 pixels in diameter (22.8 angular degrees). 


Procedure 

Participants were positioned in a head-chin rest at 60 cm distance from the center of the screen. 
The instruction did not require subjects to fixate gaze prior to the stimuli. The subjects were asked to 
recognize the emotional expression of the shown face. Images of male and female faces with different 
emotional expressions were presented in a random sequence. The duration of the stimulus exposure 
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was /00ms. Verbal labels of all possible facial expressions appeared following each faded stimulus The 
subjects responded by clicking a mouse button to indicate which emotion they thought was shown. Prior to 
the experiment, all subjects underwent training that helped to understand the task, procedure and allowed 
to actualize the names of emotional expressions. Since the differentiation of emotions is a common task 
for an adult, prolonged training was not required. At first, subjects in free viewing mode went through 
photographs of men and women showing different facial expressions. Each image was accompanied by 
a Caption indicating the displayed emotion. Then, in order to familiarize the subjects with the procedure 
and make sure that they understood the task correctly, several training trials were carried out. The images 
used in the training were not used in the main experiment. 

The duration of the main experiment did not exceed 20 minutes, and the experimental task was 
not tiring. However, since we recorded not only eye movements, but also the responses of the subjects, 
this allowed us to monitor the development of fatigue during the experiment. Comparing the percentage 
of correct answers in the first and last third of the experiment, we did not find a significant decrease in the 
performance efficiency. 


Eye-tracking 

Eye movements were recorded using the SMI Red-m tracker. The standard calibration procedure 
for the device was carried out prior to each experiment. The position of the eyes was recorded at a 
frequency of 60 Hz. The gaze localization accuracy was 30 arc minutes. For each stimulus, a fixation 
density map (FDM) was constructed by averaging over all subjects. 


Digital image processing 

Using software we developed that compares the total luminance contrast in the central operator 
window with the total contrast in the surrounding area, the face image areas with the highest (max), 
lowest (min), and intermediate (med) increases in the total contrast were established. The med areas 
were defined on a conditional straight line connecting the nearest min and max regions, while the degree 
of contrast increase in med was average between these min and max areas. 

For digital image processing, we used a concentric operator. The operator included a central area 
(central window of the operator) and a surrounding ring (peripheral part of the operator). The width of the 
peripheral ring was equal to the diameter of the central window. First, in the center area of the concentric 
operator, we calculated the total energy of the image filtered at a frequency of 4 cycles per diameter of this 
central area. This filtering frequency was set based on the optimal ratio of carrier-envelope frequencies 
for human perception of contrast modulations (Babenko, Ermakov and Bozhinskaya, 2010; Sun and 
Schofield, 2011; Li et al., 2014). In the peripheral part of the operator, the spectral power of the entire 
range of spatial frequencies perceived by a person was calculated per 1 octave on average. The contrast 
modulation amplitude was equal to the difference in the spectral power calculated between the central 
and peripheral regions of the operator. 

Changing the diameter of the operator’s window while maintaining the filtering frequency (4 cycles 
per window diameter) made it possible to identify these areas in 5 different ranges of spatial frequencies 
1 octave wide (with a center frequency of 4, 8, 16, 32 and 64 cycles per image). The relationship between 
the operator’s diameter and the filtering frequency (the smaller the diameter, the higher the frequency) 
reflects the well-known property of second-order visual mechanisms, which ensures their scale-invariant 
Capabilities (Sutter, Sperling and Chubb, 1995; Kingdom and Keeble, 1999; Dakin and Mareschal, 2000; 
Landya and Orug , 2002). 

Using the largest gradient operator, where the diameter of its central area equaled the size of the 
image, we were able to mark one area with the highest, lowest and intermediate modulation of the total 
contrast in every stimuli. Then, by repeated halving of the operator's diameter, 2, 4, 8 and 16 areas were 
marked for each contrast modulation amplitude (min, med and max). The total diameter of the identified 
at different spatial frequencies areas was equal to the diameter of the conditional circle into which the 
Original image was inscribed. For each stimulus 3 maps of the distribution of areas with the min, med and 
max modulation of contrast were constructed. These maps were a superposition of Gaussians. 


Statistical data analysis 

The empirical maps (FDMs) were compared with calculated theoretical maps which were a result 
of digital processing of stimuli. To assess the similarity of the maps, two distribution-based metrics were 
used: Pearson's linear correlation coefficient (Cc) which shows if there is a linear relationship between two 
variables; EMD (Earth mover’s distance or Wasserstein distance) which is a spatially robust measure that, 
unlike all other similar metrics, takes into account the spatial differences between theoretical and empirical 
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results (Bylinskii et al., 2018). To calculate the distance matrix, we used a computer implementation of the 
similarity metric for the Python language (Pele and Werman, 2009). 


Results 


First, we compared empirical maps for each of the 490 stimuli with the distribution maps of min, 
med, and max regions constructed from image areas identified in all five spatial frequency ranges. Due to 
the non-normal distribution of the data obtained and the heterogeneity of the variances, we used a non- 
parametric test. The medians of the correlation coefficients for min were -0.109, for med and max were 
0.323 and 0.459, respectively. By comparing these scores using the Kruskal-Wallis rank sum test (df = 2, 
n= 1470), it was found that the similarity of theoretical and empirical maps significantly increases with an 
increase in the modulation amplitude of the total contrast of the selected areas (p <0.000). The median 
EMD scores for min, med, and max were 5.266, 3.371, and 3.266, respectively. It also should be noted 
that the shorter EMD indicated less the similarity between theoretical and empirical maps. The Krus-kal- 
Wallis rank sum test showed that this similarity significantly increases with the increase in the contrast of 
the selected areas (p <0.000). 


a 





Figure 1. Examples of an empirical FDM (above) and the areas with the lowest (left), intermediate 
(center) and highest (right) increases in the total contrast, highlighted at a frequency of 16 cycles per 
image. The brightness level of the selected areas reflects the probability of gaze fixation on a given area 
of the image. 


Then we performed a similar analysis separately for each of the spatial frequency ranges. At this 
Stage, the empirical maps remained the same, and the calculated theoretical maps were built from the 
areas Identified in a narrow range (1 octave) of spatial frequencies with a central frequency of 4, 8, 16, 32 
and 64 cycles per image. The correlation analysis results are presented in Table 1. 


Table 1. 
Median scores of correlation coefficients for maps in different ranges of spatial frequencies and the 
effect of increasing the amplitude of contrast modulation 


Kruskal-Wallis 
Cycles per image min med max RES 
4 -).089 0) 294 0361 10470 < ().000 
8 -(), 094 0122 0.503 195,82 < ().000 
1b -).019 0.330 0.4/4 963,63 < ().000 
32 -).023 0000 0) 000 459,03 < ().000 
b4 -).023 0) 0ob 0.137 811,64 < ().000 
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The higher the Kruskal-Wallis chi-squared scores, the more pronounced the differences between 
the compared values (in this case, the correlation coefficients). Statistical comparison of the obtained 
scores using the Kruskal-Wallis rank sum test showed that the similarity of theoretical and empirical maps 
significantly increases with an increase in the contrast modulation amplitude of the selected areas. 


Table 2. 
EMD scores in different spatial frequency ranges and the effect of increasing the contrast modulation 
amplitude 


: | Kruskal-Wallis 
Cycles per image min med max disquarei p 
4 6, /o0) Et 3,609 1009.5 < ().000 
8 6,061 3,696 2,380) fo, f2 < ().000 
1b 3,019 1,632 1,658 843.0 < ().000 
32 f 197 4 098 A O70 {F079 < 0.000 
b4 9,601 2,001 2,6/6 328,08 < ().000 


The results of the EMD analysis (shown in Table 2) were consisted with the previous analysis. 
These results also support the conclusion that, the higher the increase in the total contrast of the selected 
areas, the more the calculated maps coincide with the empirical FDMs. 

To clarify the results obtained at various spatial frequencies, we conducted a post-hoc pairwise 
comparison of the values obtained for min, med and max areas, using Conover test (Table 3 and 4). 


Table 3. 
Post-hoc analysis results for the correlation coefficients 
Cycles per image min-med ‘ min-max 
med-max 
4 < 0.0000 < 0.0000 < 0.0000 
8 < 0.0000 < 0.0000 < 0.0000 
1b < 0.0000 < 0.0000 < 0.0000 
32 < 0.0000 = (0.14 < 0.0000 
o4 < 0.0000 < 0.0000 < 0.0000 
Table 4. 
Post-hoc analysis results for the EMD 
Cycles per image min-med ; min-max 
med-max 
A < 0.0000 < 1.0000 < 0.0000 
8 < 0.0000 < 0.0000 < 0.0000 
16 < 0.0000 < 0.0000 < 0.0000 
a2 < 0.0000 =().14 < 0.0000 
b4 < 0.0000 < ().0000 < 0.0000 


The post-hoc analysis showed that the relationship between facial areas with the greatest increase 
in nonlocal contrast and gaze fixations is disturbed at high spatial frequencies (32 and 64 cpi). It is clear 
that low and medium spatial frequencies (4, 8 and 16 cpi) are more important for attention control when 
viewing time is limited. Higher spatial frequencies also seem to be able to direct the observer's attention, 
but with a longer exposure. 
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Discussions 


The main goal of our study was to test the hypothesis that the most informative facial regions may 
be the regions with the greatest increase in nonlocal contrast. The results obtained definitively showed 
that in recognizing emotions on faces the distribution of gaze fixations significantly coincide with the 
layout of areas with the greatest increase in nonlocal contrast at low and medium spatial frequencies. The 
similarity of theoretical and empirical maps significantly decreases with a decrease in the amplitude of 
the contrast modulation in selected areas. This effect has been observed and confirmed comparing maps 
using both the correlation coefficient and the EMD. This applies to both maps that combine the selected 
areas from all five octaves, and maps constructed the 1-octave ranges of spatial frequencies. 

Consistent with previously stated, image areas with contrast modulation activate second-order 
visual mechanisms in human vision. But how can the functioning of these mechanisms be related to the 
organization of eye movements? Based on the fact that image areas that differ from the surroundings 
in their physical characteristics are more informative (Itt, Koch and Niebur, 1998; Einhauser and Konig, 
2003; Honey, Kirchner and VanRullen, 2008; Fuchs et al., 2011), it is logical to assume that the targets 
of focal attention are the areas with the greatest increase in non-local contrast. Spatially overlapping 
second-order visual mechanisms are able to automatically find these areas in the image at different 
levels of resolution. The increase in activation of this mechanisms is proportional increase of contrast 
modulation in the receptive field of the second order filter. We assume that the more the filter is activated, 
the higher its ability to draw attention to a certain part of the visual field. As a result, the most activated 
second-order visual mechanisms become “windows” for attention. Through these windows the higher 
levels of processing receive information from the preattentive stage. 

We believe that the perception of a face goes through certain stages. When a new object appears 
in the observer's field of view, a face in particular, the perception begins with separating this object from 
the background. Since second-order visual mechanisms have receptive fields of different sizes (Sutter, 
Beck and Graham, 1995; Kingdom and Keeble, 1999; Dakin and Mareschal, 2000; Landy and Orug, 
2002), it is always possible to find among them the one with a field that best matches the size of the 
appeared face. As a result, this mechanism Is centered relatively towards the appeared face. It is tuned 
to a lower spatial frequency than other, smaller second-order visual mechanisms also involved in facial 
processing. Therefore, it has an advantage in initiating the saccade. This conclusion is based on the fact 
that ultra-rapid saccades to faces are initiated precisely by low spatial frequencies (Guyader et al., 2017). 
Thus, because the low-frequency second-order visual mechanism is centered relative to the face, the 
initial saccade with a high probability will be directed towards the center of the face. This may explain 
previously reported tendency of the first saccades to be directed to the geometric center of the presented 
image (Tatler, 2007; Bindemann, Scheepers and Burton, 2009, 2010; Atkinson and Smithson, 2020). 
Attention directed to the center of the face allows us to obtain general (low-frequency) information about 
the configuration of the appeared object and classify it as a face (Meinhardt-Injac, Persike and Meinhardt, 
2010; Cauchoix et al., 2014; Comfort and Zana, 2015). As shown in Figure 1, the averaged FDM has a 
peak in the center of the face (between the nose bridge and the mouth). Moreover, statistical data analysis 
(Tables 1 and 2) confirms that the empirical map of gaze fixations most closely matches the calculated 
max map obtained at the lowest spatial frequency. 

However, prior research, both the performance results (Leder and Bruce, 1998; Cabeza and 
Kato, 2000; Collishaw and Hole, 2000; Schwaninger, Lobmaier and Collishaw, 2002; Bombari, Mast and 
Lobmaier, 2009) and neuroimaging data (Rossion et al., 2000; Harris and Aguirre, 2008; Lobmaier et 
al., 2008; Betts and Wilson, 2009; Liu, Harris and Kanwisher, 2010), indicate the contribution of not only 
configural processing, but also feature processing to face recognition. A detailed (featural) description of 
faces can be performed by second-order visual mechanisms tuned to higher spatial frequencies. These 
filters, as the frequency setting increases, highlight smaller and smaller parts of the face. It is agreed that 
the most valuable frequency range for face recognition is from 8 to 32 cycles per face (Nasanen, 1999; 
Ruiz-Soler and Beltran, 2006; Willenbockel et al., 2010; Collin et al., 2014). As shown in Figure 1 (lower 
right corner), the areas with the greatest increase in contrast in frequency range from 11 to 22 cpi (the 
central frequency is 16 cpi) are located in the area of the eyes and mouth - areas that are most informative 
for the perception of faces (Butler et al., 2010; Peterson and Eckstein, 2012; Smith, Volna and Ewing, 
2016; Royer et al., 2018). Therefore, the smaller image areas are highlighted by second-order visual 
mecha-nisms, the more detailed information is available for analysis at higher processing levels. 
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Conclusions 


The results of the study allow us to conclude that in recognizing emotional facial expressions the 
higher the luminance contrast of the facial area, the higher the probability that this area will become the 
object of the observer's attention. It was shown that gaze fixations correlate better with the regions of 
maximum modulation of nonlocal contrast, containing information from the lower half of the frequency 
spectrum. Perhaps this can be explained with the fact that in our experiments the viewing time was limited 
to 700 ms per image. This amount of time is enough to make a decision about emotional expression, 
but during this time the observer can perform only 2-4 saccades, initiated by low-frequency information. 
Increasing the exposure time will allow the observer to pay attention to the details of the perceived image 
and can enhance the connection between gaze fixations and high-frequency information. 

In our opinion, spatial modulation of contrast in an image can be extracted by the second-order 
visual mechanisms. The more the contrast is modulated in their receptive field, the higher their activation 
is. The higher the activation, the higher the probability of drawing the attention to this area of the visual field. 
Those mechanisms that are more activated can alternately attract visual attention and initiate saccades 
towards the areas with the greatest increase in nonlocal contrast, starting with lower spatial frequencies. 

The results obtained set perspectives for new studies, where it could be determined the universal 
role of modulations of nonlocal contrast in the perception of not only faces, but also other objects, as well 
as examined the role of other spatial modulations of luminance gradients (modulations of orientation or 
Spatial frequency) in bottom-up visual attention control. 

The accumulation of experimental data in this field is related to the development of image 
segmentation algorithms and solving the problem of salience. New knowledge about the regularities 
and mechanisms of determining “regions of interest” will help to optimize the operations of preliminary 
processing of input information in artificial vision systems and can be useful in the development of image 
classification systems using deep learning networks. 
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